21 research outputs found

    Joint model-based recognition and localization of overlapped acoustic events using a set of distributed small microphone arrays

    Get PDF
    In the analysis of acoustic scenes, often the occurring sounds have to be detected in time, recognized, and localized in space. Usually, each of these tasks is done separately. In this paper, a model-based approach to jointly carry them out for the case of multiple simultaneous sources is presented and tested. The recognized event classes and their respective room positions are obtained with a single system that maximizes the combination of a large set of scores, each one resulting from a different acoustic event model and a different beamformer output signal, which comes from one of several arbitrarily-located small microphone arrays. By using a two-step method, the experimental work for a specific scenario consisting of meeting-room acoustic events, either isolated or overlapped with speech, is reported. Tests carried out with two datasets show the advantage of the proposed approach with respect to some usual techniques, and that the inclusion of estimated priors brings a further performance improvement.Comment: Computational acoustic scene analysis, microphone array signal processing, acoustic event detectio

    Acoustic event detection and localization using distributed microphone arrays

    Get PDF
    Automatic acoustic scene analysis is a complex task that involves several functionalities: detection (time), localization (space), separation, recognition, etc. This thesis focuses on both acoustic event detection (AED) and acoustic source localization (ASL), when several sources may be simultaneously present in a room. In particular, the experimentation work is carried out with a meeting-room scenario. Unlike previous works that either employed models of all possible sound combinations or additionally used video signals, in this thesis, the time overlapping sound problem is tackled by exploiting the signal diversity that results from the usage of multiple microphone array beamformers. The core of this thesis work is a rather computationally efficient approach that consists of three processing stages. In the first, a set of (null) steering beamformers is used to carry out diverse partial signal separations, by using multiple arbitrarily located linear microphone arrays, each of them composed of a small number of microphones. In the second stage, each of the beamformer output goes through a classification step, which uses models for all the targeted sound classes (HMM-GMM, in the experiments). Then, in a third stage, the classifier scores, either being intra- or inter-array, are combined using a probabilistic criterion (like MAP) or a machine learning fusion technique (fuzzy integral (FI), in the experiments). The above-mentioned processing scheme is applied in this thesis to a set of complexity-increasing problems, which are defined by the assumptions made regarding identities (plus time endpoints) and/or positions of sounds. In fact, the thesis report starts with the problem of unambiguously mapping the identities to the positions, continues with AED (positions assumed) and ASL (identities assumed), and ends with the integration of AED and ASL in a single system, which does not need any assumption about identities or positions. The evaluation experiments are carried out in a meeting-room scenario, where two sources are temporally overlapped; one of them is always speech and the other is an acoustic event from a pre-defined set. Two different databases are used, one that is produced by merging signals actually recorded in the UPCÂżs department smart-room, and the other consists of overlapping sound signals directly recorded in the same room and in a rather spontaneous way. From the experimental results with a single array, it can be observed that the proposed detection system performs better than either the model based system or a blind source separation based system. Moreover, the product rule based combination and the FI based fusion of the scores resulting from the multiple arrays improve the accuracies further. On the other hand, the posterior position assignment is performed with a very small error rate. Regarding ASL and assuming an accurate AED system output, the 1-source localization performance of the proposed system is slightly better than that of the widely-used SRP-PHAT system, working in an event-based mode, and it even performs significantly better than the latter one in the more complex 2-source scenario. Finally, though the joint system suffers from a slight degradation in terms of classification accuracy with respect to the case where the source positions are known, it shows the advantage of carrying out the two tasks, recognition and localization, with a single system, and it allows the inclusion of information about the prior probabilities of the source positions. It is worth noticing also that, although the acoustic scenario used for experimentation is rather limited, the approach and its formalism were developed for a general case, where the number and identities of sources are not constrained

    Knowledge-based Framework for Intelligent Emotion Recognition in Spontaneous Speech

    Get PDF
    AbstractAutomatic speech emotion recognition plays an important role in intelligent human computer interaction. Identifying emotion in natural, day to day, spontaneous conversational speech is difficult because most often the emotion expressed by the speaker are not necessarily as prominent as in acted speech. In this paper, we propose a novel spontaneous speech emotion recognition framework that makes use of the available knowledge. The framework is motivated by the observation that there is significant disagreement amongst human annotators when they annotate spontaneous speech; the disagreement largely reduces when they are provided with additional knowledge related to the conversation. The proposed framework makes use of the contexts (derived from linguistic contents) and the knowledge regarding the time lapse of the spoken utterances in the context of an audio call to reliably recognize the current emotion of the speaker in spontaneous audio conversations. Our experimental results demonstrate that there is a significant improvement in the performance of spontaneous speech emotion recognition using the proposed framework

    Source ambiguity resolution of overlapped sounds in a multi-microphone room environment

    Get PDF
    When several acoustic sources are simultaneously active in a meeting room scenario, and both the position of the sources and the identity of the time-overlapped sound classes have been estimated, the problem of assigning each source position to one of the sound classes still remains. This problem is found in the real-time system implemented in our smart-room, where it is assumed that up to two acoustic events may overlap in time and the source positions are relatively well separated in space. The position assignment system proposed in this work is based on fusion of model-based log-likelihood ratios obtained after carrying out several different partial source separations in parallel. To perform the separation, frequency-invariant null-steering beamformers, which can work with a small number of microphones, are used. The experimental results using all the six microphone arrays deployed in the room show a high assignment rate in our particular scenario.Peer ReviewedPostprint (published version

    TCS-ILAB -MediaEval 2015: Affective Impact of Movies and Violent Scene Detection

    Get PDF
    ABSTRACT This paper describes the participation of TCS-ILAB in the MediaEval 2015 Affective Impact of Movies Task (includes Violent Scene Detection). We propose to detect the affective impacts and the violent content in the video clips using two different classifications methodologies, i.e. Bayesian Network approach and Artificial Neural Network approach. Experiments with different combinations of features make up for the five run submissions. SYSTEM DESCRIPTION Bayesian network based valence, arousal and violence detection We describe the use of a Bayesian network for the detection task of violence/non-violence, and induced affect. Here, we learn the relationship between different attributes of different types of features using a Bayesian network (BN). Individual attributes such as Colorfulness, Shot length, or Zero Crossing etc. form the nodes of BN. This includes the valence, arousal and violence labels which are treated as categorical attributes. The primary objective of the BN based approach is to discover the cause-effect relationship between different attributes which otherwise is difficult to learn using other learning methods. This analysis helps in gaining the knowledge of internal processes of feature generation with respect to the labels in question, i.e. violence, valence and arousal. In this work, we have used a publicly available Bayesian network learner [1] which gives us the network structure describing dependencies between different attributes. Using the discovered structure, we compute the conditional probabilities for the root and its cause attributes. Further, we perform inferencing for valence, arousal and violence values for new observations using the junction-tree algorithm supported in Dlib-ml library As will be shown later, conditional probability computation is a relatively simple task for a network having few nodes which is the case for image features. However, as the attribute set grows, the number parameters namely, conditional probability tables grow exponentially. Considering that our major focus is on determining the values of violence, valence and arousal values with respect to unknown values Copyright is held by the author/owner(s). MediaEval 2015 Workshop, Sept. 14-15, Wurzen, Germany of different features, we apply the D-separation principle Artificial neural network based valence, arousal and violence detection This section describes the system that uses Aritificial Neural Networks (ANN) for classification. Two different methodologies are employed for the two different subtasks. For both subtasks, the developed systems extract the features from the video shots (including the audio) prior to classification. Feature extraction The proposed system uses different set of features, either from the available feature set (audio, video, and image), which was provided with the MediaEval dataset, or from our own set of extracted audio features. The designed system either uses audio, image, video features separately, or a combination of them. The audio features are extracted using the openSMILE toolkit [4], from the audio extracted from the video shots. openSMILE uses low level descriptors (LLDs), followed by statistical functionals for extracting meaningful and informative set of audio features. The feature set contains the following LLDs: intensity, loudness, 12 MFCC, pitch (F0), voicing probability, F0 envelope, 8 LSF (Line Spectral Frequencies), zero-crossing rate. Delta regression coefficients are computed from these LLDs, and the following functionals are applied to the LLDs and the delta coefficients: maximum and minimum value and respective relative position within input, range, arithmetic mean, two linear regression coefficients and linear and quadratic error, standard deviation, skewness, kurtosis, quartile, and three inter-quartile ranges. openSMILE, in two different configurations, allows extractions of 988 and 384 (which was earlier used for Interspeech 2009 Emotion Challenge [5]) audio features. Both of these are reduced to a lower dimension after feature selection. Classification For classification, we have used an ANN that is trained with the development set samples available for each of those subtask. As data imbalance exists for the violence detection task (only 4.4% samples are violent), for training, we have taken equal number of samples from both the classes

    Joint recognition and direction-of-arrival estimation of simultaneous meeting-room acoustic events

    No full text
    Acoustic scene analysis usually requires several sub-systems working in parallel for carrying out the various required functionalities. Focusing to a more integrated approach, in this paper we present an attempt to jointly recognize and localize several simultaneous acoustic events that take place in a meeting room environment, by developing a computationally efficient technique that employs multiple arbitrarily-located small microphone arrays. Assuming a set of simultaneous sounds, for each array a matrix is computed whose elements are likelihoods along the set of classes and a set of discretized directions of arrival. MAP estimation is used to decide about both the recognized events and the estimated directions. Experimental results with two sources, one of which is speech, and two three-microphone linear arrays are reported. The recognition results compare favorably with the ones obtained by assuming that the positions are known.Peer ReviewedPostprint (published version

    Sound-model-based acoustic source localization using distributed microphone arrays

    No full text
    Acoustic source localization and sound recognition are common acoustic scene analysis tasks that are usually considered separately. In this paper, a new source localization technique is proposed that works jointly with an acoustic event detection system. Given the identities and the end-points of simultaneous sounds, the proposed technique uses the statistical models of those sounds to compute a likelihood score for each model and for each signal at the output of a set of null-steering beamformers per microphone array. Those scores are subsequently combined to find the MAP-optimal event source positions in the room. Experimental work is reported for a scenario consisting of meeting-room acoustic events, either isolated or overlapped with speech. From the localization results, which are compared with those from the SRP-PHAT technique, it seems that the proposed model-based approach can be an alternative to current techniques for event-based localization. © 2014 IEEE.Peer Reviewe
    corecore